Assignment 1

Author

Isabella Villanueva

Explanatory Data Analysis of PM2.5 in California from 2002 to 2022

What is PM 2.5?

PM or particulate matter is a term for a mixture of solid particles and liquid droplets found in the air. Some particles, such as dust, dirt, soot, or smoke, are large or dark enough to be seen with the naked eye. Others are so small they can only be detected using an electron microscope (United States Environmental Protection Agency (EPA)).

PM2.5: fine inhalable particles, with diameters that are generally 2.5 micrometers and smaller (EPA).

Summary of Data:

When checking the dimensions of both data sets from 2002 and 2022, there are significantly more rows of data (readings through the year) from the year 2022 (59756) compared to the year 2002 (15976). This could impact the mean calculated from both years, and a question to whether these two years can be comparable with the difference in specificity of readings. Though, these two data sets do have the same amount of variables which helps with the comparability aspect.

Found in the 2002 table “pmdata2002” with the following code: tail(pmdata2002)

one value on 12/22/2002 reading a daily mean PM2.5 concentration of 1 ug/m3 had concerned me initially. But after searching the data, readings even below 1.0 are common through the year of 2002, thus dismissing my initial concerns.

Combine 2002 and 2022 Data into One Data Frame, the Create New Column that Signifies Year

library(readr)
pmdata2002 <- read_csv("~/Downloads/pmdata2002.csv")
Rows: 15976 Columns: 22
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (12): Date, Source, Site ID, Units, Local Site Name, AQS Parameter Descr...
dbl (10): POC, Daily Mean PM2.5 Concentration, Daily AQI Value, Daily Obs Co...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
pmdata2022 <- read_csv("~/Downloads/pmdata2022.csv")
Rows: 59756 Columns: 22
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (12): Date, Source, Site ID, Units, Local Site Name, AQS Parameter Descr...
dbl (10): POC, Daily Mean PM2.5 Concentration, Daily AQI Value, Daily Obs Co...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
pmdata2002$Year <- 2002
pmdata2022$Year <- 2022
combined_data <- rbind(pmdata2002, pmdata2022)

Rename Key Variables

names(combined_data)[names(combined_data) == "Daily.Mean.PM2.5.Concentration"] <- "Daily_PM2.5"
names(combined_data)[names(combined_data) == "Daily.AQI.Value"] <- "Daily_AQI"

Map Displaying Site Locations

library(data.table)
library(leaflet)
pm_stations <- unique(combined_data[, c("Site Latitude", "Site Longitude")])  

leaflet(pm_stations) |> 
  addProviderTiles('CartoDB.Positron') |> 
  addCircles(lat = ~`Site Latitude`, lng = ~`Site Longitude`,
             opacity = 1, fillOpacity = 1, radius = 400, color = c('pink','blue'))

Check for any missing values of PM2.5 in the combined dataset

missing_values <- sum(is.na(combined_data$`Daily Mean PM2.5 Concentration`))
cat("Total missing values in PM~2.5~", missing_values, "\n")
Total missing values in PM~2.5~ 0 

Check for implausible values

implausible_values <- sum(combined_data$`Daily Mean PM2.5 Concentration` < 0, na.rm = TRUE)
cat("Total implausible values in PM~2.5~", implausible_values, "\n")
Total implausible values in PM~2.5~ 215 

Explore Proportions of missing and implausible values

total_observations <- nrow(combined_data)
missing_proportion <- missing_values / total_observations
implausible_proportion <- implausible_values / total_observations
cat("Proportion of missing values in PM~2.5~ concentrations:", missing_proportion, "\n")
Proportion of missing values in PM~2.5~ concentrations: 0 
cat("Proportion of implausible values in PM~2.5~ concentrations:", implausible_proportion, "\n")
Proportion of implausible values in PM~2.5~ concentrations: 0.002838958 

Summarize Patterns of These Observations

When checking for missing values of PM2.5 concentrations, there seems to be zero values missing – most likely due to previous data cleaning.

When checking for the total implausible values (measured by the implausibility of measuring a negative PM2.5 concentration), 430 values were found. This implausibility can be attributed to an error of measurement tools, user- or technical error.

Exploratory Graphs

State-Level Data by Year

boxplot(combined_data$`Daily Mean PM2.5 Concentration` ~ combined_data$Year)

County-Level Data by Year

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ purrr     1.0.2
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::between()     masks data.table::between()
✖ dplyr::filter()      masks stats::filter()
✖ dplyr::first()       masks data.table::first()
✖ lubridate::hour()    masks data.table::hour()
✖ lubridate::isoweek() masks data.table::isoweek()
✖ dplyr::lag()         masks stats::lag()
✖ dplyr::last()        masks data.table::last()
✖ lubridate::mday()    masks data.table::mday()
✖ lubridate::minute()  masks data.table::minute()
✖ lubridate::month()   masks data.table::month()
✖ lubridate::quarter() masks data.table::quarter()
✖ lubridate::second()  masks data.table::second()
✖ purrr::transpose()   masks data.table::transpose()
✖ lubridate::wday()    masks data.table::wday()
✖ lubridate::week()    masks data.table::week()
✖ lubridate::yday()    masks data.table::yday()
✖ lubridate::year()    masks data.table::year()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
combined_data |> 
  filter(!is.na(County) & !is.na(Year)) |> 
  ggplot() + 
  geom_point(mapping = aes(x = Date, y = `Daily Mean PM2.5 Concentration`)) + 
  facet_grid(County ~ Year, scales = "free")

One Site in Los Angeles Data

The following analysis evaluates the Los Angeles location named ‘North Main Street Station’.

library(dplyr)
combined_data <- combined_data %>%
  mutate(LA_site = ifelse(`Local Site Name` == "Los Angeles-North Main Street", `Local Site Name`, NA))

LA_site <- combined_data[!is.na(combined_data$LA_site), ]
library(tidyverse)
combined_data[!is.na("LA_site")] |>
  ggplot() +
  geom_point(mapping = aes(x = Year, y = `Daily Mean PM2.5 Concentration`, color= "LA_site"), color = "lightblue") +
  facet_wrap(~ "LA_site", nrow = 1, scales = "free")+
  labs(title = "Daily PM2.5 Concentration for Los Angeles-North Main Street Station, 2002 vs. 2022")+
  labs(x = expression("Date"), y = "Daily PM2.5 Concentration (µg/m^3)")

We can notice a large disparity in the variability of the data from both years 2002 and 2022, where 2022 has higher readings of daily PM2.5 concentrations while the data from 2022 has a condensed amount of readings that range from 0 to roughly 105.

This is consistent with the previously shown tiers of data, where this one location in Los Angeles allows a more specific look into this disparity of data variability.